Model-based characterization of the equilibrium dynamics of transcription initiation and promoter-proximal pausing in human cells

Abstract In metazoans, both transcription initiation and the escape of RNA polymerase (RNAP) from promoter-proximal pausing are key rate-limiting steps in gene expression. These processes play out at physically proximal sites on the DNA template and appear to influence one another through steric interactions. Here, we examine the dynamics of these processes using a combination of statistical modeling, simulation, and analysis of real nascent RNA sequencing data. We develop a simple probabilistic model that jointly describes the kinetics of transcription initiation, pause-escape, and elongation, and the generation of nascent RNA sequencing read counts under steady-state conditions. We then extend this initial model to allow for variability across cells in promoter-proximal pause site locations and steric hindrance of transcription initiation from paused RNAPs. In an extensive series of simulations, we show that this model enables accurate estimation of initiation and pause-escape rates. Furthermore, we show by simulation and analysis of real data that pause-escape is often strongly rate-limiting and that steric hindrance can dramatically reduce initiation rates. Our modeling framework is applicable to a variety of inference problems, and our software for estimation and simulation is freely available.


Introduction
Across all branches of life, homeostatic control of gene expression arises from a series of dynamic equilibria among competing processes.For example, cellular RNA concentrations reflect an equilibrium between RNA production and decay; RNA polymerase ( RNAP ) occupancy reflects an equilibrium among transcription initiation, elongation, and termination; and protein concentrations reflect an equilibrium between protein synthesis and degradation.When such concentrations change, say, across conditions, cell types, or developmental stages, it is typically because the relative rates of competing processes are altered and a new equilibrium is reached, rather than because any individual process is wholly enabled or disabled.
For decades, research on the regulation of eukaryotic gene expression focused on RNAP recruitment and transcription initiation, which were thought to be the rate-limiting steps under most circumstances ( 1 ) .In recent years, however, it has become clear that many downstream steps in transcription can be regulated.One striking observation, first noted at particular loci in Drosophila and mammals (2)(3)(4)(5)(6) , is that RNAPs are frequently held in a 'paused' position about 18-60 nt downstream of the transcription start site, before escaping into productive elongation (7)(8)(9)(10)(11)(12)(13)(14)(15)(16) .Later studies showed that such promoter-proximal pausing is widespread across metazoans ( 14 ) .Several lines of evidence indicate that the escape from promoter-proximal pausing is frequently a regulated step in gene expression ( 17 ) .It has been argued that regulation at the pause-escape stage may be particularly advantageous when a rapid transcriptional response is required, as in heat shock or other stimulus-controlled pathways ( 11 ,17-22 ) .
It has been noted that promoter-proximal pausing not only has a direct effect on the rate at which RNAPs proceed into productive elongation, but may also have an indirect influence on the rate of productive initiation, owing to steric hindrance between paused and initiating RNAPs.The key observation is that the physical space immediately downstream of the transcription start site ( TSS ) is limited: each RNAP has a 'footprint' of ∼33-35 nt ( 23 ,24 ) and the pause site is typically no more than 60 nt downstream of the TSS.Thus, depending on precisely how much space is required between two adjacent RNAPs, there is typically room for only one, or perhaps two or three, RNAPs in the pause region before new initiation events begin to be blocked.Indeed, computer simulations of the movement of RNAPs along the DNA template have suggested that such steric hindrance could substantially reduce initiation rates ( 24 ) .More recently, genome-wide studies of human ( 25 ,26 ) and Drosophila ( 27 ) cells found strong evidence that initiation rates were restricted by pause-escape rates at many genes.Thus, it appears that rates of productive initiation are often governed by a dynamic equilibrium between transcription initiation and pause-escape.
In recent years, two types of approaches have dominated in the study of transcriptional dynamics: ( 1 ) highresolution imaging approaches based on fluorescence in situ hybridization, photobleaching, or electron micrography in single cells ( e.g.(28)(29)(30)(31)(32) ) ; and ( 2 ) genomic approaches based on nascent RNA sequencing ( NRS ) or chromatin immunoprecipitation ( ChIP ) and sequencing across populations of cells ( e.g. ( 17 , 25 , 27 ) ) .The imaging approaches allow for more direct characterization of the dynamics of individual RNAPs, but as yet, they cannot be carried out at scale; instead, they are typically applied to one or a few loci.The genomic approaches are more indirect, requiring tagging and pull-down of sequences such as newly synthesized RNA or RNAP-associated DNA.In addition, because they produce summaries for large populations of cells, they typically require fairly complex statistical analyses for interpretation.Nevertheless, these methods have the crucial advantage of being applicable at genomewide scale for modest cost, by exploiting the many recent advances in genome-sequencing and related technologies.
For these reasons, we focus in this article on the study of transcriptional dynamics using genomic data.We focus in particular on the use of data from the 'RO-seq' family of NRS protocols-including GRO-seq ( 9 ) , PRO-seq ( 15 ,33 ) and ChRO-seq ( 34 ) -which specifically detect active RNAPs in transcriptionally-competent complexes ( 35 ) .These protocols have matured dramatically in recent years, with major improvements in resolution, background signal, ease-of-use and cost.They can be thought of as producing 'snapshots' of the positions of engaged RNAPs across a population of cells either at steady-state or in a time course after a stimulus is applied.We develop a general statistical modeling framework that allows interpretation of these snapshots in a manner that reveals relative rates of transcriptional initiation, promoterproximal pause escape, and elongation, as well as interrelationships among these processes.
Notably, we focus on modeling transcriptional dynamics under equilibrium conditions, which, in comparison to studies of time courses following transcriptional stimuli, allows us to examine larger sets of genes and avoid the potential off-target effects of commonly used drugs such as triptolide.The focus on steady-state conditions also leads to relatively simple and interpretable mathematical results.Importantly, our methods are applicable not only to newly produced NRS data sets, but to the thousands of sequenced samples that are already publicly available in databases such as the Gene Expression Omnibus ( 36 ) .
We apply these new methods to both simulated and real NRS data, and refine our model to account for variable pause sites across cells and steric hindrance of transcription initiation from paused RNAP.To support our experiments, we have developed a fast and flexible simulator for NRS data, called SimPol, which is freely available along with our software for parameter estimation ( see 'Code Availability' ) .Altogether, we find strong quantitative evidence that promoterproximal pausing has major importance in the dynamics of transcription in human cells, that RNAP occupancy in the pause region tends to be high at many genes, and that initiation rates are frequently limited by paused RNAPs.We discuss various implications of these findings in detail.

A simple probabilistic model for transcription initiation, promoter-proximal pausing, and elongation
Our initial model consists of two layers: a continuous-time Markov model for the movement of individual RNA polymerases ( RNAPs ) along a transcription unit ( TU ) , and a conditionally independent generating process for the read counts at each nucleotide site ( Figure 1 A, B, see also bioRxiv: https: // doi.org/ 10.1101/ 2021.01.12.426408 ).Together, these components produce a full generative model for NRS read counts along the TU, permitting inference of transcriptional rate parameters from the raw data.There is a long history of application of similar quantitative models to the processes of transcription and translation (37)(38)(39)(40)(41)(42)(43), but to our knowledge, they have not been used to study promoter-proximal pausing or its feedback on transcription initiation rates.
Our Markov model consists of a state Z i for each nucleotide position i ∈ {1, …, N } of the RNAP plus an additional state, Z 0 , that represents free RNAPs.It is parameterized by a transcription-initiation rate α, a rate of promoter-proximal pause escape β, and a termination rate γ (see Supplementary Table S1 for a summary of modeling notation).In addition, it includes an elongation rate ζ i for each position i , which can either be assumed constant across sites (with ζ i = ζ as throughout this manuscript) or allowed to vary.For mathematical convenience (see below), the parameters α, β and γ are multiplied by the corresponding ζ i parameters.In this manuscript, we focus on inference of α and β and largely ignore γ, because the 3 ends of TUs are difficult to characterize.
In this version of the model, promoter-proximal pausing is assumed to occur at a fixed position k along the DNA template, where k can be pre-estimated from the data.This assumption will be relaxed in subsequent sections.We further assume that the read counts at each position i are Poissondistributed with mean μ i , where μ i is a scaled version of the probability density of Z i (and, hence, the RNAP density at nucleotide i ) that reflects the initiation and elongation rates, as well as the sequencing depth.Finally, in order to make use of a time-homogeneous Markov chain, we assume that RNAPs are sufficiently sparse along the DNA template that collisions between them are rare-another assumption that will be relaxed later.Notably, the model can be applied either in a nonequilibrium setting based on time-course data or to a single data set at steady state.We focus in this manuscript on the more limited steady-state case, because it is more mathematically convenient and interpretable, yet, as we show below, still allows for a quantitative characterization of the dynamic equilibrium between transcription initiation and pause-escape.In order to ensure identifiability of initiation and pause-escape rates at steady state, we assume that initiating RNAPs remain on the DNA template for its entire length, with negligible rates of premature termination (see Discussion).
This simple version of the model results in convenient, closed-form maximum-likelihood estimators for β and a parameter closely related to α.Because the initiation rate α is confounded at steady-state with the elongation rate ζ and the sequencing depth λ, we instead work with a compound parameter χ = λ α ζ representing the read-depth-scaled ratio of the initiation rate to the elongation rate.It turns out that the MLE for χ is simply equal to the average read depth along the gene body and the MLE for β is given by the ratio of the average read depth in the gene body to that in the pause peak (see below).The estimates of both α and β can be considered relative to ζ, based on the multiplicative parameterization of the model.Notably, similar average-read-depth estimators have been widely used in the analysis of nascent RNA-sequencing data, typically with more heuristic justifications.For example, the inverse of the estimator for β is commonly known as the 'pausing index' ( 14 ,44 ), and the estimator for χ is often used as a general measure of transcription output ( 19 ).In our case, these estimators emerge as MLEs under a generative probabilistic model, making it possible to characterize the dynamics of transcription from raw NRS read counts at steady state.

Continuous-time Markov model
The probabilistic model (Figure 1 A, B) consists of a continuous-time Markov model that describes the stochastic movement of individual RNAPs along the DNA template, and a conditional generating process by which read counts arise independently at each site in proportion to RNAP occupancy (defined below).The Markov model consists of N + 1 states corresponding to the N possible nucleotide positions of the active site of an RNAP as it moves along an Nnucleotide DNA template, plus an additional state (labeled 0) that abstractly represents 'free' RNAPs, not currently engaged in transcription and available for new initiation events.Each state i in the Markov model corresponds to a binary random variable Z i , indicating whether the RNAP is ( Z i = 1) or is not ( Z i = 0) at position i at a particular time t (Figure 1 B).In our setting, the use of this time-homogeneous model depends on two key assumptions: (i) that collisions between RNAPs are rare, allowing the movement of each RNAP to be considered independently of the others and (ii) that premature termination of transcription is sufficiently rare that each RNAP can be assumed to traverse the entire DNA template if it is given enough time (see Discussion for limitations).
The model distinguishes between two segments of each transcription unit: (i) the first k nucleotides, known as the pause peak , where RNAP tends to accumulate owing to promoter-proximal pausing (typically k ≈ 50) ( 9 ); and (ii) the subsequent N − k nucleotides, where RNAP tends to be relatively unimpeded, which is typically referred to as the gene body .Movement of the RNAP is defined by four rate parameters: an initiation rate α (from state 0 to state 1), a pause-escape rate β (from state k to state k + 1), a termination rate γ (from state N to state 0), and a constant pernucleotide elongation rate ζ (for all other allowable transitions).Because the states must be visited in a sequence, the infinitesimal generator matrix for the Markov chain Q = { q i j } has a simple form, with positive terms only on the diagonal q ij such that j = i + 1, negative terms on the main diagonal, and zeroes elsewhere.For mathematical convenience, we assume that the initiation, pause-escape, and termination steps are coupled with single-nucleotide elongation steps and occur at rates ζα, ζβ and ζγ, respectively.As a result, as long as ζ is the same across nucleotides, it can be considered a scaling factor that applies equally to all steps in the process.Specifically, all other i, j such that j = i + 1 0 all other i, j such that i = j − i : i = j q i j i = j, (1) where element q ij indicates the instantaneous rate at which an RNAP transitions from state i to state j , and by convention, the values along the main diagonal are set such that the rows sum to zero.

Stationary distribution
This continuous-time Markov model allows for calculation of the probability of transitioning from any state i to any other state j over a period of time t ≥ 0, but in this article we are interested in steady-state conditions and therefore focus on the limiting distribution for ending states as t → ∞ .This stationary distribution, denoted π, is invariant to ζ and can be found easily by solving the equation πQ = 0 .Because state 0 is simply an abstraction to allow for the recirculation of RNAP, we omit it and describe the stationary distribution conditional on RNAP occupancy along the DNA template.This conditional stationary distribution can be expressed as π = (π 1 , . . ., π N ) such that, This distribution has an intuitive interpretation.First, it is natural that, conditional on RNAP occupancy, the steadystate distribution is invariant to both α (which defines the rate at which occupancy is initiated but has no effect thereafter) and ζ (which defines the 'flow' along the DNA template but does not favor one nucleotide position over another).In addition, as a result of local slowdowns in elongation, π i is elevated relative to the gene body by factors of 1  β and 1 γ at the pause peak and termination peak, respectively.Notice that both peaks take the form of 'spikes' at single nucleotide positions under this model; in later sections we will generalize the model to allow for a broader pause peak.
When comparing different transcription units, we have to allow for differences in TU-specific initiation and elongation rates.In particular, in addition to obeying equation 2 , the relative RNAP densities at TU j will be proportional to where α j and ζ j are the TU's initiation and elongation rates, respectively.Furthermore, as detailed below, estimation of these rates is confounded by the sequencing depth, λ.Because these parameters are not identifiable at steady state, we represent them by the compound parameter χ j = λα j ζ j .In addition, the termination peak tends to be difficult to characterize with real nascent RNA sequencing data, owing to transcriptional run-on, poorly characterized 3 ends of genes, and other factors.Therefore, from this point on, we omit the parameter γ and assume N is defined such that ambiguities at the 3 ends of TUs are excluded.
With these assumptions, when comparing the RNAP densities at various sites i along various TUs j , we expect them to be proportional to, (3)

Generative model for sequence data
To allow for the fact that the RNAP positions are only indirectly observed through the sequencing process, we add a second layer to the model that describes the probabilistic process by which sequencing read counts are generated conditional on an underlying density of RNAP at each nucleotide, as defined by the continuous-time Markov model.In this way, we obtain a full generative model for the observed sequence data that is defined by the model parameters, enabling inference of all parameters from the data.Because we have freedom in how to set the read-depth scaling parameter λ (see below), we simply take μ ij (as defined in equation 3 ) to be the expected read depth at position i of TU j .We then assume that the read count X i , j is Poisson-distributed with this mean.It is possible to use other generating distributions to allow for overdispersion of the read counts (see bioRxiv: https:// doi.org/ 10.1101/ 2021.01.12.426408 ), but the Poisson assumption seems to be adequate in our case and it is particularly convenient for parameter inference.
With these assumptions, let the data for a single TU be denoted X = (X 1 , . . ., X N ) , where X i represents the number of sequencing reads having their 3 end aligned to position i .(We omit the j index in this case to simplify the notation.)Assuming conditionally independent Poisson distributions at each site, the steady-state log likelihood function for a single TU is given by, where s = N i =1 X i is a statistic equal to the sum of all read counts and Z = N i =1 X i ! is a normalization term that does not depend on the model parameters and can be ignored during optimization.

Inference at steady state
Under this likelihood function (equation 4 ), the maximumlikelihood estimators for χ and β have simple closed-form solutions: Notice that the estimator for χ is simply the average read depth, excluding the pause peak, and the estimator for β is the ratio of that same average read depth to the read depth in the pause peak.These are estimators that have been widely used in the analysis of nascent RNA sequencing data, with more heuristic justifications ( 14 , 19 , 44 ).
In practice, it tends to be better to avoid the complex signal in the pause region in estimating χ and instead estimate it from a downstream portion of the gene body.We define s = j+ M −1 i = j X i , where M is the length of the interval considered, and estimate χ and β as, For the remainder of the paper, we assume the use of these simpler, more robust estimators for χ and β.

Allowing for variation in the pause site
In real samples, the pause site tends to vary across cells, leading to a broad pause peak in nascent RNA sequencing data.We address this complication by allowing the location of the pause peak, k , to vary between a k min and a k max according to an appropriate distribution, and then assuming each read count X k within this range reflects a mixture of cells that do and do not have their pause site at position k .
This log likelihood function does not have a closed-form solution but it is straightforward to maximize by expectation maximization.The complete-data log likelihood function, with known values of Y k , can be expressed in terms of compact sufficient statistics (analogous to equation 4 ) as, where s = i X i and t = k Y k .The expected value of this function, averaging over the latent variables Y k -the quantity to maximize in EM-is simply, where Y k and t = k Y k denote the posterior expected values of Y k and t , respectively.For simplicity, assume that χ is pre-estimated for a portion of the gene body downstream of k max using equation 7 .If in addition the values of f k are fixed, then β can be simply estimated as, The values Y k can be computed by observing that, because Thus, an EM algorithm can be implemented by iteratively applying equations 11 and 13 , in turn, until convergence.Furthermore, to estimate the distribution of pause sites from the data at each TU, we assume the f k values have a truncated Gaussian distribution with mean μ and variance σ 2 , where the explicit normalization constant Z is needed because the distribution is applied to a bounded interval and is defined at integer values only.
In this case, the EM updates for μ and σ 2 are simply, where, Notice that all of these quantities are easily computed from the Y k values together with f k and X k .

Approximate relationship between β and the pausing index I P
The model above requires iterative estimation, but an approximate closed-form expression for β in terms of the pausing index I P can be obtained by noting that the excess reads in the pause region, t , can be estimated reasonably well by multiplying the difference between the average read depths in the pause region and gene body by the length of the pause region, L = k max − k min + 1: where ˆ χ P is the average read depth in the pause region and χ is the pausing index as computed by averaging across the pause region.Therefore, Thus, an interpretable pause-escape rate parameter can be estimated approximately from the pausing index I P .This equation provides a physical interpretation for I P and also explains why our naive initial estimation strategy tended to overestimate β by a factor of approximately L .This strategy, however, does not allow estimation of the mean or variance of the pause site at each gene.

Allowing for steric hindrance of initiation
To accommodate steric hindrance of initiation at steady state, we introduce a distinction between a potential rate of initiation in the absence of occlusion of the initiation site, α, and the effective rate of initiation after a portion of initiation events are blocked by an existing RNAP molecule, which we denote ω .We assume ω = (1 − φ) α, where φ is the probability of that the 'landing pad' required for a new initiation event is already occupied by an RNAP.Thus, ω ≤ α.Notice that any estimation of initiation rates based on the density of RNAPs in the gene-body will be representative of ω , not α; a correction may be required to estimate α accurately.
We first assume that the 'footprint' of an engaged polymerase, , is sufficiently large that at most one RNAP can be present in this region at a time.(We relax this assumption below.)We further assume that elongation up to position k occurs much faster than the initiation rate αζ or the pauseescape rate βζ, so that the dynamics of elongation through the pause peak can be ignored.In this case, occupancy of the landing pad can be described using a simple two-state continuoustime Markov model (Figure 2 A).Here, either the landing pad is unoccupied (state 0) and is therefore available for new initiation events, which occur at rate αζ; or the landing pad is already occupied by an RNAP (state 1) and no new initiation events can occur until that RNAP escapes from the pause site, which occurs at rate βζ.Thus, at steady state, the landing-pad occupancy φ is simply given by the stationary distribution of the occupied state, and the effective initiation rate, allowing for steric hindrance, is given by, Notice that these equations also imply that ω = φβ, meaning that, at steady state, the effective initiation rate ω must always be less than or equal to the pause-escape rate β, with ω approaching β as the landing-pad occupancy φ approaches unity.Therefore, if one estimates an effective initiation rate ˆ ω and a pause-escape rate ˆ β from the data, as described above, then one can obtain estimates of φ and α as follows, Notice that the estimator for φ will be proportional to the read-counts in the pause peak (see, e.g., equation 7 ).

Steric hindrance with multiple RNAPs
As it turns out, the assumption of ≤1 RNAPs per pause region is often too restrictive, and the presence of more than one RNAP in this region can have a substantial impact on the landing-pad occupancy φ.In this section, we generalize the model for steric hindrance to allow for any number r of RNAPs in the pause region, focusing in particular on the case of r ≤ 3, which we expect to cover essentially all plausible scenarios in human cells.Let s p be the minimum center-to-center spacing, in nucleotides, between adjacent RNAPs on the DNA template.As noted in the main text, structural data suggests s p is at least 33 nt but more plausibly s p ≈ 50 nt ( 23 , 24 , 45 , 46 ); we also consider the case of s p = 70 nt for comparison.We further assume that a new initiation event can successfully occur if, and only if, the previous RNAP has advanced to a position i > s p (in other words, the 'landing pad' for new initiation events has size = s p ). Consequently, the maximum possible number of RNAPs in the pause region is r and so on (see Figure 2 B).In general, r = k s p .In addition, we assume a probability mass function f k for the pause site k across cells, with cumulative distribution function Let q 1 be the density associated with r = 1; that is, q 1 = F ( k = s p ).Similarly, q 2 is the density associated The pause region must be either unoccupied (state 0) or already occupied by another RNAP (state 1).Transitions from state 0 to state 1 occur at the (unimpeded) initiation rate, αζ, and transitions from state 1 to state 0 occur at the pause-escape rate, βζ.The stationary frequency of state 1 defines the landing-pad occupancy φ and is given by α α+ β .( B ) Illustration showing a hypothetical distribution of pause sites k and its implications for the number of RNAPs that can simultaneously occupy the pause region.When k ≤ s p , where s p is the minimum center-to-center spacing between adjacent RNAPs, only one RNAP is possible (Case 1 in the text); when s p < k ≤ 2 s p , up to two are possible (Case 2); and when 2 s p < k ≤ 3 s p , up to three are possible (Case 3).Notice that the portion of the density corresponding to each Case r is given by q r .( C ) Generalization of Markov model to accommodate up to two RNAPs in the pause region (Case 2).( D ) Further generalization to accommodate up to three RNAPs (Case 3).The equation for φ can be generalized to account for these cases (see text).
Now, in each cell, k has a single value and therefore one of a series of mutually exclusive cases must apply.Let us denote by Case r (for r ∈ {1, 2, 3, …}) that the maximum possible number of RNAPs is r .Thus, in Case 1, r = 1 and k ≤ s p ; in Case 2, r = 2 and s p < k ≤ 2 s p ; and so on.Given f k , we know that Case r occurs with probability q r .Therefore, we can calculate φ as a mixture of case-specific landing-pad probabilities, φ = ∞ r =1 q r φ r , where φ r is the probability that the landing pad (first s p nucleotides) is occupied in case r .In practice, we approximate this quantity as φ ≈ R r =1 q r φ r where R is the maximum plausible value of r (here, R = 3) and, where the last term, q R , is 'rounded up' to account for the remaining tail of the distribution.The case of r = 1 has already been described in the previous section.It can be captured by a two-state model, where the landing-pad is either unoccupied (state 0) or occupied by a single RNAP (state 1; Figure 2 A).Therefore, It turns out that the cases of larger values of r follow naturally through the addition of more states.For example, when r = 2, the landing pad is either unoccupied (state 0), occupied by one RNAP (state 1), or occupied by two RNAPs (state 2).Assuming no two events can occur simultaneously, state 2 can be reached only when a new initiation event occurs while state 1 is occupied (at rate αζ), and a pause-release event in state 2 causes a return to state 1 (at rate βζ).The result is a chain of states as shown in Figure 2 C. In addition, under the assumption that elongation through the pause peak is instantaneous, the landing pad is occupied if, and only if, state 2 is occupied.Thus, φ 2 is given by the stationary frequency of state 2 in this model, which can be shown to be, Similarly, the case of r = 3 can be addressed by extending the chain further with a state 3, and setting φ 3 equal to the stationary frequency of that state (Figure 2 D), Therefore, assuming R = 3, we can estimate φ as, allowing for the possibilities of one, two, or three RNAPs in the pause region of each cell.As s p grows larger and / or f k shifts toward the TSS, the fraction q 1 approaches 1, and Equation ( 25 ) approaches Equation ( 18 ).As it turns out, substitution of ω / (1 − φ) in place of α in this more general expression for φ leads to a complex polynomial that cannot be easily solved for ω .Instead, we solve this equation numerically to obtain φ and α from ω and β (analogous to equation 20 ).

Fitting the steric-hindrance model to data
The multi-RNAP steric hindrance model can be combined with the model for variable pause sites across cells and fitted to the data by a relatively straightforward extension of the EM algorithm described above.As in that case, we assume that the unscaled initiation rate, χ, is pre-estimated from data in the gene body.In addition, however, this case requires preestimation of the scale-factor λ and the elongation rate ζ, so that a scaled estimate of the effective initiation rate, ω = χζ λ , can be obtained from each estimate of χ.In practice, we obtain these scale factors by calibrating with respect to other studies (see 'Calibrating the initiation rate,' below).
In addition, we must address the problem that the landingpad occupancy φ is constrained to fall between 0 and 1, but the relationships above do not enforce such a constraint.For example, in the single-RNAP case, where φ = ω β (equation 20 ), φ will be undefined whenever ω > β.A similar (but more complicated) relationship holds in the multi-RNAP case.
To address this problem, we take a Bayesian approach and assume a (weakly) informative prior distribution for φ.This strategy not only restricts φ to the allowable range but has the benefit of regularizing the model when information in the data about φ is weak.With the assumption of a Beta( φ | a , b ) prior for φ, with shape parameters a and b (we assume a = b = 2 throughout), the likelihood (cf.equation 8 ) becomes, (26) and the expected complete-data log likelihood (cf.equation 10 ) becomes, From a comparison of equations 10 and 27 , it is evident that the calculation of the summary statistics t , u , v , w , z , and r and the updates for μ and σ 2 will all remain unchanged (equations 12 -15 ).However, this simplified presentation obscures that the parameters φ and β are implicitly linked by a function, φ = g ( β; χ), which is indirectly defined by equation 25 as well as by the relationship ω = (1 − φ) α.Thus, the update for β in this case is no longer that shown in equation 11 but now must also consider the terms that depend on φ.As it turns out, in the full multi-RNAP model, rewriting equa-tion 27 in terms of β only (i.e., by substitution for φ) leads to rather unwieldy polynomial expressions.Nevertheless, the M -step in the EM algorithm can be performed numerically without much trouble.In particular, on each iteration of the algorithm, we calculate all expected sufficient statistics as before ( E -step), but then, for the M-step, we estimate β by numerically maximizing the portion of equation 27 that depends on β and φ, that is, subject to the constraints of equation 25 and the relationship ω = (1 − φ) α.Thus, the previous EM algorithm can be adapted to the multi-RNAP steric hindrance case by simply replacing the M -step for β with this numerical optimization.
No other changes are required.

SimPol simulator
SimPol ('Simulator of Polymerases') tracks the independent movement of RNAPs along the DNA templates of a large number of cells (Figure 1 C).It accepts several user-specified parameters, including the initiation rate, pause-escape rate, a constant or variable elongation rate, the mean and variance of pause sites across cells, as well as the center-to-center spacing constraint between RNAPs ( s p ), the number of cells being simulated, the gene length, and the total time of transcription (Supplementary Table S2).The simulator simply allows each RNAP to move forward or not, in time slices of 10 −4 minutes, according to the specified position-specific rate parameters.It assumes that at most one movement of each RNAP can occur per time slice.The simulator monitors for collisions between adjacent RNAPs, prohibiting one RNAP to advance if it is at the boundary of the allowable distance from the next.After running for the specified time, SimPol outputs a file in csv format that summarizes all RNAP positions across cells.An R script (also provided) can then be used to simulate nascent RNA sequencing read counts based on this file.SimPol is written in C++ and makes use of multi-threading to improve performance.A typical simulation (e.g., transcribing a 2kb gene in 20,000 cells for 400,000 steps) requires only a few minutes on a desktop computer.

Generation of synthetic NRS data
Using SimPol, we simulated genes 2000 bp in length with initiation ( αζ) and pause-escape ( βζ) rates that spanned two orders of magnitude, ranging from 0.1 to 10 events per min.per cell.Elongation rates at each nucleotide position were randomly sampled from a truncated normal distribution, with mean = 2000 nt / min, sd = 1000 nt / min, min = 1500 nt / min and max = 2500 nt / min.When a fixed pause site was assumed, it occurred at position k = 50 nt; variable pause sites assumed a truncated normal distribution for k , with mean = 50 nt, sd = 25 nt, min = 17 nt and max = 200 nt.Our main simulations assumed a center-to-center spacing s p = 50 nt, but alternative simulations assumed s p = 33 nt and s p = 70 nt.
For each parameter combination, we simulated 20 000 cells for the equivalent of 40 min (400 000 time slices), which appeared to be sufficient to reach equilibrium in all cases (see also Supplementary Table S2 for a summary).

PAGE 9 OF 20
Based on the output of each SimPol run, we randomly sampled 5000 of the 20 000 cells.This sampling step was performed 50 times for each run.To save in computation, the same source collection of 20 000 cells was used for each replicate, after verifying that it made little difference to rerun the full simulation to equilibrium each time.For each of these replicates, we then sampled a read count at each position i from a Poisson distribution with mean μ i such that, where R i is the number of sampled cells having an RNAP whose active site (center) is at position i and C s = 5000 is the total number of sampled cells.The scale parameter λ c was calibrated such that a typical choice of initiation ( αζ = 1 events / min) and pause-escape ( βζ = 1 events / min) rate parameters resulted in an average read depth of 0.049 reads / bp in the gene body, as we observed in the real data from ref. ( 19 ) (median value).
For simplicity, we performed this step separately at each nucleotide only in the pause region (the first 200 nucleotides).Because the RNAP densities throughout the gene bodies are fairly homogeneous, and the corresponding read counts are averaged anyway, we sampled the total read count for each gene body in one final step by scaling the Poisson distribution appropriately.This strategy also allowed us to extrapolate from our 2,000-bp simulated genes to genes of more realistic length.Specifically, the total read count for the gene body was sampled from a Poisson distribution with mean μ GB such that, where R GB is the total number of RNAPs across the simulated gene body, C s = 5000 is the total number of cells sampled, l targ = 19 800 bp is the target gene-body length, and l sim = 1, 800 bp is the simulated length.Notice that the scale parameter λ c was held fixed throughout so that average read-depths would increase or decrease appropriately as the rate parameters were altered.

Analysis of real data
We obtained published K562 PRO-seq libraries from the heat shock ( 20 ) and celastrol studies ( 19 ) and processed them using the PROseq2.0 pipeline (https:// github.com/Danko-Lab/ proseq2.0) in single-end mode ( 47 ).The 3 ends of readswhich approximately represent the active sites of isolated RNAPs-were recorded in bigWig files and used for analysis.Mapping was performed with human genome assembly GRCh38.p13 and gene annotation were downloaded from Ensembl (release 99) in GTF ( 48 ).Annotations of proteincoding genes from the autosomes and sex chromosomes were used, excluding overlapping genes on the same strand.To improve TSS positioning, we augmented the gene annotations with CoPRO-cap (Coordinated Precision Run-On and sequencing with 5 Capped RNA) data for K562 cells from ref. ( 16 ) (see ( 26 ,49 ) for similar uses of GRO-cap data).In particular, we used the position with highest CoPRO-cap signal within a 250 bp radius around each annotated TSS as a refined TSS and discarded genes for which no CoPRO-cap signal was found.We then considered the 200 bp starting at this refined TSS as the 'pause region,' and the region from 1250 bp to up to 90 kb downstream (but not past the annotated end of the gene) of this TSS as the 'gene body.' W e excluded any gene with fewer than 20 reads mapped to either the pause peak or the gene body.For the remaining genes, the read counts at each of 200 positions in the pause region, and the total read counts in the gene body, were summarized in a table and used for all downstream analyses.

Calculation of half-lives
RNAP half-lives were calculated from estimates of the pauseescape rate β and average values of the pause-site position k by the following equation, where k = 50 nt, ζ = 2000 nt / min, and ˆ β is the estimated pause-escape rate.Here, the quantity ( k − 1 + 1 / ˆ β ) /ζ represents the expected time required for an RNAP to pass through the pause region and the pause site, and the factor log 2 converts the mean of an exponential distribution to a half-life.

Discriminative motif finding
DNA sequences 200 bp downstream of the TSS were extracted, and STREME ( 50 ) was used to identify the motifs enriched in the 10% genes with narrowest pause peaks (smallest ˆ σ 2 ) compared with the 10% of genes with broadest peaks (largest ˆ σ 2 ).Tomtom ( 51 ) was then used to annotate the results with known motifs from HOCOMOCO v11 ( 52 ).To validate the sequence motif enrichment of TAF1, ChIP-seq signals for TAF1 in K562 cells downstream of the TSS were extracted separately for genes exhibiting narrow and broad peaks.These signals were plotted using the R package Genomation ( 53 ).

Calibrating the initiation rate
As described in the Results section, we made use of a 'low' (L) calibration of 0.2 initiation events per minute (54)(55)(56)(57), and a 'high' (H) calibration based on ref. ( 26 ), who reported a median of 1.0 initiation events per minute for mRNAs in K562 cells.These calibrations were performed separately for each of our four analyzed data sets: the untreated and treated samples from the heat-shock ( 20 ) and celastrol ( 19 ) studies.We performed all calibrations using 'housekeeping' (HK) genes from ref. ( 58 ), to minimize sensitivity to differences between assays and other properties.We first identified a subset of genes that fell in the intersection of the HK set, the genes from ( 26 ), and each of our sets.This subset numbered between 1079 and 1082 genes for our four data sets.In each case, for the L calibration, we simply scaled our ω values so that the median value of ω ζ within this set was equal to 0.2 events / min in the control samples.For the H calibration, we scaled our ω values so that the median value of ω ζ within this set matched the median value reported by Gressel et al. ( 26 ) for the same subset of genes, which were 1.52 and 1.53 events / min in the control samples.The ω ζ in treated samples were then scaled based on the ratio of spike-in reads ( 19 ) or total mappable reads ( 20 ) between the control and treated samples.Finally, to focus on genes exhibiting robust expression in both the treated and untreated samples, we identified a subset of genes that fell in the top 80% by ˆ χ in both samples.This filter resulted in a total of 6,182 genes for the heat-shock data set and 5,964 genes for the celastrol dataset.

An evaluation with simulated data reveals strengths and limitations of the initial model
To examine the quality of the maximum-likelihood estimators for the initial probabilistic model (Figure 1 A, B, see Materials and Methods), we developed a fast, flexible computational simulator, called SimPol ('Simulator of Polymerases'), that tracks the progress over time of individual RNAPs across DNA templates in thousands of cells under user-defined initiation, pause-escape, and elongation rates (Figure 1 C).Unlike our assumed model, SimPol tracks potential collisions between RNAPs and prohibits one RNAP from passing another along the template.In addition, it allows for variation across cells in pause-site location and variation across both cells and nucleotide positions in local elongation rate (see Materials & Methods for details).After running to equilibrium, the simulator generates synthetic read counts at each nucleotide position by counting the RNAPs across cells, and then sampling read counts conditional on local RNAP density (see Figure 1 D for an example gene with synthetic data).
Using SimPol, we generated synthetic data sets for a range of plausible initiation and pause-escape rates, with read depths corresponding to those of genes with median expression levels from real data (see Materials & Methods).We performed separate sets of simulations for cases where the pause site is fixed at a known value k and for cases where k is allowed to vary across cells according to a truncated Gaussian distribution.We assumed a mean value of k equal to 50 nt and a standard deviation of 25 nt, approximately as we observe in real data (see Materials & Methods), and required a minimum center-to-center distance between adjacent RNAPs of 50 nt (as in ( 25 )).For each simulated data set, we estimated χ and β from the synthetic data and compared the estimated and true rates.In cases where the pause site was variable, we naively estimated β by using the average read depth across a 200 bp pause peak region, reasoning that such an averaging approach is typically used in computing a pausing index (e.g. ( 19)).However, we considered other estimators as well (see below).In addition, for β, we evaluated the product βζwhich represents the absolute rate of pause-escape-assuming the mean value of ζ = 2 kb / min that was used for simulation.For α, we calibrated our estimates relative to a baseline case simulated with αζ = 1 and no pausing and again assumed ζ = 2 kb / min.In this way, we were able to account for both the sequencing read depth and the elongation rate, and express both the initiation and pause-escape rates in absolute units of events per minute, for ease of interpretation.
We found that these initial estimates of α and β were accurate in some cases but significantly biased in others (Figure 3 , Supplementary Figure S1).In particular, estimates of the initiation rate α were close to the truth when α was not too large and β was not too small; but either a moderately high α or a moderately low β led to a notable downward bias in the α estimates (Figure 3 A).This bias ranged from ∼50% when α was high or β was low but the other parameter was in the favorable regime, to more than an order of magnitude when both parameters were unfavorable.As we explore below, these biases suggest an influence from steric hindrance of new initiation events owing to RNAPs in the pause peak, which would be expected to lead to under-estimation of α precisely when pause-escape rates are low and / or initiation rates are high.Importantly, these biases can be pronounced for plausible values of α and β, suggesting that it can be misleading to treat average read counts in the gene body as a measure of the initiation rate (see Discussion).
By contrast, in the case where the pause site was held fixed and assumed known during estimation, estimates of β were accurate across a range of true α and β values (Figure 3 B; Supplementary Figure S1), indicating that the model describes the dynamics of pause-escape well.When the distance to the pause site, k , was allowed to vary across cells, however, and the readdepth in the pause peak was estimated by averaging across sites, the increased read density owing to pausing was dramatically under-estimated, resulting in a strong upward bias in estimates of β (Figure 3 C; Supplementary Figure S1).In this case, the 'spike' of read counts becomes a rounded 'peak', and using its average height across the region naturally overestimates β.As it turns out, this simple source of bias is deceptively difficult to eliminate.For example, if the maximum, rather than the average, read-depth in the pause peak is used in the estimator for β, the bias is somewhat reduced but remains pronounced (Supplementary Figure S1).A better approach is to use the sum of read counts across the 200 bp pause-peak region (Figure 3 D).Even in this case, however, a downward bias in β is observed when the initiation rate equals or exceeds the pause-escape rate.To remedy this problem, we develop an extension to the model in the next section.

Explicitly modeling pause-site variability across cells improves estimates of pause-escape rates
To address the bias in estimation of β in the presence of variable pause sites, we extended the model to allow for a distribution of pause sites across cells (see Materials & Methods).In this version of the model, the read counts at each site in the pause peak are assumed to arise from a mixture of cells in which an RNAP is and is not paused at that position.If a Gaussian distribution of pause sites is assumed, the model can be fitted to the data relatively simply and efficiently by expectation maximization (EM).This procedure results in maximum-likelihood estimates not only of χ and β but also of the mean and variance of the pause-site position across cells.Thus, this version of the model no longer requires prior information about the pause-site k , but instead allows its mean and variance to be estimated separately at each gene from the raw data.
We reanalyzed our simulated data using this version of the model and found that it was highly effective at correcting the bias in estimated values of β.In this case, the same model addresses the fixed and variable pause-site scenarios equally well across a range of values of α and β (Figure 4 A; Supplementary Figure S2).Whereas the averaging approach described in the previous section resulted in over-estimates of β by more than two orders of magnitude, that bias was completely eliminated when variability in the pause site was modeled directly.Thus, typical average-based estimates of a 'pausing index' ( I P ) for real data may substantially under-estimate the prominence of promoter-proximal pausing in RNAP dynamics (see Discussion).The model addresses this problem.Only when α is quite small and β is quite large (bottom of Supplementary Figure S2), resulting in few excess reads in the pause peak, do the model-based estimates sometimes exhibit a substantial downward bias.Notably, while the full model requires an iterative method such as EM, it is possible to derive an approximate closed-form expression for β in terms of the pausing index I P , To validate the ability of our model to capture differences in the pause-site distributions across cells, we applied it to synthetic data sets with larger and smaller variances in pausesite locations.We found that the model was able to recover the correct distributions fairly accurately, as long as the read counts in the pause peak were not too sparse (Figure 4 B).Notably, the model is provided with no prior information about the mean or variance in the pause-site location, but is only constrained to consider a range of possible values (here from k min = 1 to k max = 200).When applied to real data, the model also appears to do well at identifying plausible distributions across cells in the locations of pause sites (Figure 4 C).

Estimates of pause-escape rates and pause-site distributions for real data
Having observed good performance on synthetic data, we applied the model to published PRO-seq data for K562 cells before and after heat shock ( 20 ), focusing at first on the untreated sample.In comparison to results from the previous version of the model, the distribution of β estimates was substantially shifted downward, as expected from simulations (Figure 4 D).For comparison with experimental results (e.g. ( 25 , 26 , 59 )), we converted these estimated pause-escape rates to half-lives for RNAP residence in the pause region, conditional on an assumed elongation rate of ζ = 2 kb / min ( 18 , 25 , 26 , 59 ).For a direct comparison, we considered the time required for elongation up to the pause site, as well as the waiting time for escape (see Materials & Methods).We estimated a median half-life of 1.4 min.and a mean of 2.8 min.(Supplementary Figure S3A).These estimates are fairly similar to ones obtained experimentally by Gressel et al. for human Raji B cells ( 25 ) (median pause duration of 1.4 min, corresponding to a median half-life of 1.0 min) and for K562 cells ( 26 ) (median pause duration of 1 min), but somewhat smaller than those from some other recent experimental studies (e.g. ( 59); see Discussion).We additionally examined PROseq data from another study of K562 cells ( 19 ) and estimated a similar distribution of half-lives, with a median of 1.2 min.and mean of 2.1 min.(Supplementary Figure S3B).Notably, induction of heat shock and treatment with the drug celastrol both increased half-lives by 2-3-fold (Supplementary Figure S3C,  D), consistent with the observations of increased pause indices in the original reports ( 19 ,20 ).
Interestingly, the pause peaks estimated for both of these data sets varied substantially across genes in their breadth, as quantified by the estimated variance across cells in the pause-site position ( ˆ σ 2 ).Hypothesizing that the underlying DNA sequences might contribute to the precision of pausing, we searched for sequence motifs that distinguished the pause regions having the narrowest peaks (10% with smallest ˆ σ 2 ) from those having the broadest peaks (10% with largest ˆ σ 2 ), focusing again on the untreated samples (see Materials & Methods).In both the heat-shock ( 20 ) and celastrol ( 19 ) data sets, we found that the most strongly enriched motif in the narrow peaks closely matched the binding motif for Data from ( 20 ).In panels A and D, a value of ζ = 2 kb / min is assumed; thus, αζ and βζ can be assumed to have units of events per minute.T A T A-Box Binding Protein Associated Factor 1 (TAF1), the largest subunit of general transcription factor IID (TFIID) and a key component of the pre-initiation complex (PIC) ( 60 ,61 ) (Supplementary Figures S4 and S5, panels A & B).Consistent with these motif enrichments, chromatin immunoprecipitation and sequencing (ChIP-seq) data for TAF1 in K562 cells from the ENCODE project ( 62 ) exhibits a substantially stronger signal at narrow peaks than at broad peaks, particularly near the TSS (Supplementary Figures S4C and S5C).Interestingly, it has recently been shown that the PIC alone is sufficient to establish RNAP pausing, and that rapid TAF1 depletion induces pause-release genome-wide ( 46 ).We found other enriched motifs, but they mostly consisted of repetitive, highly G+C-rich sequences, similar to observations in other recent studies ( 25 , 63 , 64 ).
We were also interested in possible effects of cellular stress on pause-site locations.Metaplots summarizing the accumulated PRO-seq signal across large classes of genes have suggested that pause peaks may tend to shift in location and become sharper following stress, particularly for downregulated genes (e.g., ( 20 ,33 )).To test whether our model supported such a change, we compared our estimates of the pause site locations, before and after application of heat shock (Figure 4 E).Indeed, we observed a striking reduction in both the mean and variance of k after heat shock.This shift may reflect differences in activity of protein complexes required for pausing, including NELF , DSIF , P-TEFb, or the PIC (e.g. ( 46 ,65 )).For comparison, we examined PRO-seq data from the celastrol study, which reported a stress response that resembles heat shock in some ways ( 19 ).In this case, however, we did not observe a clear change in the mean and variance of k (Supplementary Figure S6).

A modeling extension to accommodate steric hindrance in initiation
It seemed likely that the observed under-estimation of α was driven by steric hindrance of initiation from paused RNAPs (Figure 3 A), as has been noted in previous studies of both simulated and real data (24)(25)(26)(27).To ensure that steric hindrance was indeed driving this phenomenon in our setting, we collected two types of auxiliary data from our simulation experiments.First, we tracked the fraction of cells in which the 'landing pad' for a potential new initiation event was already occupied by an RNAP ('landing-pad occupancy'), which is a close proxy in our simulations for the fraction of potential initiation events that were not allowed to occur owing to steric hindrance.We found that this fraction was frequently quite high (often 50% or more, and in some cases ≥90%), and tended to be highest when the bias in estimated α was most pronounced (Supplementary Figure S7A).Second, we measured the rate at which initiation events successfully occurred (not being blocked) and compared it with our estimates of α, finding much better agreement between this 'effective' (sometimes called 'productive' ( 25 ,26 )) initiation rate and our estimates (Supplementary Figure S7B).Finally, we found that our estimation accuracy for α (as measured by the ratio of the estimated to true values) had a close negative correlation with the landing-pad occupancy ( Supplementary Figure S7C).Together, these findings indicate that new initiation events in our simulations are frequently blocked by RNAPs occupying the region just downstream of the TSS, and that this phenomenon explains much, if not all, of the bias in estimation of α.
We therefore extended our probabilistic model to address this problem and permit a quantitative description of steric hindrance of initiation at steady state.Briefly, we introduced a distinction between an effective initiation rate ω and a potential initiation rate α, letting ω = (1 − φ) α, where φ is the landing-pad occupancy.The model assumes that a fraction φ of new initiation events are blocked by RNAPs occupying the landing pad, and the remaining fraction 1 − φ are allowed to proceed (see Materials & Methods for complete details).It allows for more than one RNAP per pause region, because the landing pad will likely be blocked in some cases by RNAPs stacking up behind the pause site.Based on our observations above, we reinterpreted our previous estimator for α instead as an estimator for the effective rate ω (and hence, redefined χ = λ ω ζ ).This new model allows us to compute φ in terms of α, β, k , and an assumed minimum spacing between RNAPs, denoted s p .An extended EM algorithm allows joint estimation of φ, α, and β from the data, with a Beta prior distribution to ensure that φ remains in the allowable range, φ ∈ (0, 1) (Materials & Methods).In effect, this strategy allows us to correct the assumed model for steric hindrance, yielding estimates not only of the effective initiation rate ω and pause-escape rate β, as in the previous case, but also of the landing-pad occupancy φ and, hence, the potential initiation rate α.
When we applied this new model to our simulated data, we found that it generally appeared to work as intended.The estimates of φ were broadly consistent with an empirical measure of the landing-pad occupancy across a range of values of α and β (Figure 5 A), Not surprisingly, given the indirect information about φ in the data, the variance in these estimates was substantial; nevertheless, their mean values were close to the truth even for large φ, although the Beta prior did sometimes produce a slight downward bias in this case.The estimates of the pause-escape rate β remained quite good overall even in the presence of substantial amounts of steric hindrance (Supplementary Figure S8).
The most difficult parameter to recover was the potential initiation rate α, about which the data are only weakly informative via the effective initiation rate ω and the landing-pad occupancy φ.Nevertheless, α was estimated well when α ≤ β and φ ≤ 0.5 (Supplementary Figure S9).When α β and φ → 1, however, the denominator of the estimator α = ω 1 −φ becomes unstable, and in the most extreme cases (e.g.αζ = 10, βζ = 0.1), it is no longer possible to estimate α accurately.In these cases, the effective initiation rate is still estimated well, but it is not possible to extrapolate from it to the potential rate, because only a tiny fraction of potential initiation events are allowed to occur.
Our initial experiments assumed a value of s p = 50 nt for the minimum center-to-center spacing between adjacent RNAPs (following ( 25 )), but the true value of s p is not known with any certainty; therefore, we examined the robustness of our model to other plausible values of this parameter.In general, smaller values of s p will increase the chances that a single paused RNAP will not block the landing pad, but will also allow more RNAPs to stack up behind the pause site.In contrast, when s p grows larger and approaches k , the distance to the pause-site, at most one RNAP can occupy the pause region at a time, and that RNAP will necessarily block the landing pad (see Materials & Methods for details).We used SimPol to generate data sets for two alternative choices of s p : a more generous value of s p = 70 nt and the minimum possible value of s p = 33 nt ( 24 ).The case of s p = 33 nt is probably unrealistic given the physical space required for the other components of the PIC (e.g., ( 45 ,46 )) but is nevertheless useful as a bound.In our simulations, where k has a mean of 50 and standard deviation of 25, ≥2 paused RNAPs are possible at 82%, 54%, and 23% of cells for s p = 33, s p = 50, and s p = 70 nt, respectively; and ≥3 paused RNAPs are possible at 28%, 2.4% and 0.02%, respectively.
The estimates of φ did not differ dramatically as s p was varied, but this sensitivity analysis did reveal two types of bias (Supplementary Figure S10).First, when s p was small and α ≈ β, φ tended to be somewhat overestimated.This overestimation in φ appears to be a consequence of a slight underestimation of β in the presence of multiple RNAPs per pause region (Supplementary Figure S11), which are ignored by our variable pause-site model.Second, we observed a downward bias when αζ and βζ both took the maximum value of 10, apparently because RNAPs in transition between paused positions, which are ignored by the model, make a nonnegligible contribution to landing-pad occupancy in this highly saturated case.This effect is most evident when s p ≥ 50.Overall, however, the φ and β estimates were reasonably accurate across all parameter values, indicating that the model is fairly robust to assumptions about RNAP spacing.

A quantitative characterization of steric hindrance in K562 cells
Using this model we re-analyzed the untreated K562 sample from Vihervaara et al. ( 20 ), focusing on 6,182 robustly expressed genes (the top 80% by ˆ χ).When fitting the model to real data, however, a problem of scale arises: the effective initiation rate ω can only be estimated up to a scale factor, which is determined by the (unknown) ratio of the read depth to the elongation rate, λ/ ζ.At the same time, the pause-escape rate β-whose estimator is a ratio of two summary statistics (equations 7 and 11 )-has a fixed scale; thus, the estimator for the landing-pad occupancy, ˆ φ ≈ ω/β (equations 20 and 25 ), depends on the choice of scale for ω .
To address this problem, we calibrated the scale of ω using published estimates of the initiation rate.Because there is considerable uncertainty about this quantity in the literature, we selected two different calibration points, one on the low end of the reported range for median productive initiation rates in eukaryotes, at 0.2 events per minute ( 54 ,66 ), and another on the higher end, at ∼1.0 events / min (median of estimates reported by ( 26 )).We used a set of 'housekeeping' genes ( 58 ) for calibration, reasoning that these genes would minimize sensitivity to differences among species, cell types, and conditions.We scaled our estimates of ω in the control samples such that the median value of ω ζ within housekeeping genes in our set was either equal to 0.2 (the low (L) calibration) or equal to the median value in matched housekeeping genes from ref. ( 26 ), which was 1.5 events / min (the high (H) calibration; see Materials & Methods for details).
In the case of the L calibration, we found that the distribution of φ estimates across genes spanned a fairly broad range, with a median of 0.16 and a mean of 0.25 (Figure 5 B), suggesting that, on average, about a quarter of landing pads are occupied at steady state, but considerable fractions of genes have substantially lower or higher occupancies.About 5% of genes were predicted to have their landing-pads fully occupied ( φ ≥ 0.95; Figure 5 C).The scaled estimates of the effective initiation rate ω ζ were roughly exponentially distributed, with a median of 0.17 events per minute per cell (Figure 5 D).By combining our estimates of φ and ω ζ, we were additionally able to obtain per-gene estimates of the potential initiation rate αζ.These estimates were modestly inflated, with a median of 0.22 and ∼3% of values ≥2.0 events per minute (Figure 5 D).Thus, steric hindrance appears to result in a reduction in the effective initiation rate in these cells, but the effect is fairly subtle in this condition.
When we assumed the H calibration, the φ distribution shifted accordingly to the right.Indeed, in this case 76.3% of genes were predicted to have fully occupied landing pads ( φ ≥ 0.95; Figure 5 C), and as a result, the majority of αζ values became poorly defined, suggesting that this initiation-rate calibration is perhaps too aggressive.Nevertheless, by considering a broad range of potential initiation-rate calibrations, we estimated that the landing-pad occupancy is generally fairly substantial, with the median value of φ ranging from 16% to full occupancy.Thus, our model predicts that steric hindrance has a substantial impact on initiation rates at steady state, even in the untreated condition.
For comparison, we repeated the analysis with the treated sample, following heat shock (HS).Consistent with observations of increased pausing after HS ( 20 ,26 ), we found a dramatic shift toward larger φ estimates in this sample (Figure 5 B).With the L calibration, the fraction of genes with fully occupied landing pads ( φ ≥ 0.95) increased from 5.4% to 21.5% (Figure 5 C).Accordingly, steric hindrance was predicted to have a substantially stronger impact on the effective initiation rate after HS, decreasing from a median of 0.25 events / min to 0.13 events / min (Figure 5 E).We also analyzed untreated and treated samples from the K562 / celastrol study ( 19 ) using 5,964 genes with initiation rates calibrated using similar L and H strategies (Materials & Methods).The results were qualitatively similar to those from the heat-shock analysis, but the absolute φ estimates were somewhat lower, shifting from a median of 0.13 in the untreated case to 0.30 after treatment (L calibration; Supplementary Figure S12A).Application of celastrol resulted in a striking increase in fully occupied landing-pads, from 2.7% to 14.8% of genes under the L calibration, but as in the heat shock analysis, the H calibration indicated nearly complete landing-pad occupancy before treatment and a limited shift after treatment (Supplementary Figure S12B).Prediction of αζ indicated a modest reduction in initiation rates from steric hindrance before treatment (Supplementary Figure S12C) and a more pronounced one after treatment (Supplementary Figure S12D).Overall, we found that steric hindrance has a clear impact on productive initiation, and that impact is particularly striking during responses to cellular stress.

Characterization of heat-induced and heat-repressed genes
The previous analysis grouped all genes together, potentially obscuring important differences among up-and downregulated genes.Therefore, we re-examined the K562 heatshock data from Vihervaara et al. ( 20 ), this time focusing on groups of genes designated in that study as heat-induced (503 genes) or heat-repressed (4399 genes).We first compared estimates of the effective initiation rate ( ω ζ) and the pause-escape rate ( βζ) for these two sets of genes, separately considering the untreated, or non-heat shock (NHS), and heat-shock (HS) conditions.As expected, the model did find that, under HS, the effective initiation rates of designated heat-repressed genes significantly decreased, whereas those of designated heat-induced genes significantly increased (Supplementary Figure S13A).In addition, we found that heatrepressed genes showed a significant decrease in the pauseescape rate under HS, suggesting that promoter-proximal pausing contributes to down-regulation of these genes (Supplementary Figure S13B) as concluded by the authors ( 20 ) (see also ( 21 )).By contrast, the heat-induced genes showed relatively little change in the pause-escape rate under HS.While increased rates of pause-escape are known to be important in the HS response ( 67 ), initiation rates have also been reported to increase at up-regulated genes ( 20 ,22 ).Our observations suggest that the increased initiation rates are more durable, and dominate once a new equilibrium is reached.We carried out a parallel analysis of the untreated and treated samples from the K562 / celastrol study ( 19 ) with similar results (Supplementary Figure S14).
Even among up-or down-regulated genes, changes in gene expression can in principle be driven more at the transcription-initiation stage or at the pause-escape stage.Our steady-state model cannot fully reconstruct the dynamics of a change in transcriptional output, but we hypothesized that it would, in at least some cases, provide an indication of which stage had been dominant.To address this question, we searched for examples of genes that fell in each of four distinct classes: (A) down-regulated primarily at pause-escape; (B) down-regulated primarily at initiation; (C) up-regulated primarily at pause-escape; or (D) up-regulated primarily at initiation.The normalized PRO-seq read counts, together with model-based estimates of the effective initiation rate ( ω ζ) and the pause-escape rate ( βζ) for two examples of each class are shown in Supplementary Figure S15A-D.In each case, the PRO-seq signal in the gene body indicates clear up-regulation or down-regulation, but the behavior at the pause peak relative to the gene body indicates differences in the primary driver of the change in expression.For example, the gene EMC2 , which is down-regulated nearly four-fold in response to heat shock (Supplementary Figure S15A), shows a marked increase in the PRO-seq signal at the pause peak relative to the gene body, leading to a 15-fold decrease in its estimated pause-escape rate, suggesting that pause-escape is the driver of down-regulation.By contrast, the gene AGL , which is downregulated about two-fold, shows almost no change in the estimated pause-escape rate (Supplementary Figure S15B), suggesting that down-regulation is driven by initiation.These examples demonstrate that our steady-state model can identify cases where changes in expression are likely to have been driven by initiation or pause-escape.Future extensions to nonequilibrium models and time-course data will improve this type of analysis.

Discussion
The widespread occurrence of promoter-proximal pausing has been one of the major surprises of the past ∼15 years in the study of gene regulation.Most studies of this phenomenon have focused on its molecular mechanisms and its direct impact on rates of productive elongation ( 17 ).In addition, however, there have been indications that such pausing, when sufficiently pronounced, also imposes indirect limits on rates of transcription initiation (24)(25)(26)(27).In this article, we have shown through analysis of simulated and real data that many as-pects of this complex interplay between initiation and pauseescape can be captured by relatively simple probabilistic models.
Our models not only describe raw read counts as a function of variable rates of initiation and pause-release, enabling estimation of these rates from NRS data, but they also allow for variable pause sites across cells, for reductions in initiation rates via steric hindrance, and for the effects of multiple RNAPs stacking up behind the pause site.We have shown by simulation that both variable pause sites and steric hindrance can have major impacts on the estimated rate parameters.Our analysis of real data indicates not only that pause sites tend to be highly variable, but that their degree of variability is correlated with particular sequence motifs as well as with stress responses.Similarly, steric hindrance of transcription initiation appears to occur at many genes and is intensified in stress responses.Both of these phenomena appear to play major roles in shaping the patterns of aligned reads from NRS data, particularly near the 5 ends of transcription units.We have focused in our initial work on applications to PRO-seq data, where read counts can be assumed to reflect the density of engaged, transcriptionally competent RNAPs.Additional work will be required to see if our methods can be adapted for use with protocols that more broadly identify chromatin-associated RNAs or Pol II molecules, such as mNET-seq or caRNA-seq (reviewed in ( 35 )).
Our analysis shows that, even at steady-state, it is possible to obtain accurate estimates of relative initiation and pauseescape rates.If, in addition, separate estimates are available of the average elongation rate per gene, then the initiation and pause-escape rates can be expressed in absolute terms as numbers of events per cell per unit time.In turn, these estimated rates can be used to obtain various downstream quantities of interest, such as pausing half-lives (Supplementary Figure S3), landing-pad occupanies (Figure 5 B, C, Supplementary Figure S12A, B), or, through combination with other data sources, half-lives of RNA molecules or related quantities ( 68 ).Even when estimated elongation rates are not available, it may be adequate for some applications to estimate 'ballpark' rates based on the average rate of elongation across genes, which appears to be fairly stable at ∼2 kb / min in mammalian cells ( 18 , 25 , 26 , 59 ).We have adopted this approach here in approximating absolute initiation rates, landing-pad occupancies (Figure 5 ; Supplementary Figure S12), and pausing halflives (Supplementary Figure S3) from publicly available NRS data.We should emphasize, however, that our per-gene estimates are only approximate and may be biased by differences across genes in elongation rate.
With these caveats, our estimates of the half-lives of paused RNAPs-with mean values of 2-3 min.in untreated K562 cells and ∼6 min.after heat shock or celastrol treatment (Supplementary Figure S3)-agreed reasonably well with previous experimental estimates.For example, Jonkers et al. measured a mean half-life of 6.9 min.at ∼3200 genes in mouse ES cells ( 59 ); Shao et al. found most half-lives were between 5 and 20 min at 2300 genes in Drosophila Kc167 cells ( 27 ); and Gressel et al. estimated a median pause duration of 1.4 min.(corresponding to a half-life of 1.0 min.)for 2135 genes in human Raji B cells ( 25 ).(The same group subsequently estimated a similar rate for 6355 protein-coding genes in K562 cells ( 26 ).)In addition, Henriques et al. reported half-lives of promoter Pol II complexes from Drosophila S2 cells of ∼2-15 min at selected loci ( 69 ).However, differences in species, PAGE 17 OF 20 cell types, treatments and sets of genes make these estimates difficult to compare precisely .Notably , the largest estimates ( 27 ,59 ) were obtained after treatment with triptolide, which could potentially lead to some inflation in the estimates ( 31 ).Despite these differences, our method seems to be in general agreement overall with previous findings that paused Pol II is typically stable for minutes, and sometimes for tens of minutes, although we detect fewer extremely long half-lives than some previous studies.It is likely that our estimator for β reaches saturation when pause peaks become unusually elevated, leading to reduced sensitivity for the extreme tail of the distribution of pausing half-lives.At the same time, our estimator may be more sensitive to short half-lives than some experimental methods.
Despite such limitations, the use of steady-state rather than time-course data also has some important advantages.This approach allows us to analyze all expressed genes, not just a subset at which expression can be induced.In addition, it requires no chemical treatment to block initiation or pause escape, and therefore avoids potential off-target effects.At the same time, it is worth noting that our framework can be extended to the nonequilibrium setting and the use of time-course data (see bioRxiv: https:// doi.org/ 10.1101/ 2021.01.12.426408 ).In that setting, it could be used to infer genespecific elongation rates together with the other quantities.This version of the model, however, is considerably more complicated mathematically, more difficult to fit to data, and harder to interpret.Additional work will be required before it can be applied to real data.
Our estimator for the pause-escape rate β is closely related to the quantity known as the 'pausing index' ( I P ), which is frequently used to measure the prominence of promoterproximal pausing ( 14 ,44 ).In the case where the pause site is constant across cells, we have shown that I P is precisely the inverse of the maximum likelihood estimator for the rate parameter β.When the pause site varies across cells, however, it becomes less straightforward to interpret I P in physical terms, and a naive interpretation may lead to a strongly biased characterization of the pause-escape rate (e.g. Figure 4 D).The details of how I P is calculated-e.g. by averaging the read counts in the pause region or using their maximum value (Figure 3 C and Supplementary Figure S1)--also become important in this case.We have shown that our model has a natural extension to variable pause sites, which permits estimation of both the rate parameter β and the distribution of pause sites per gene by expectation maximization.In this setting, β no longer has a simple relationship to I P .Interestingly, however, it is possible to find an approximate closed-form relationship in the case where I P is calculated by averaging, namely, 1 ˆ β ≈ L ( I P − 1 ) , where L = k max − k min + 1 is the length of the pause region (see Materials & Methods, equation 17 ).We anticipate that this relationship may help to standardize definitions of I P and clarify its physical meaning.
Our model also has implications for how to interpret the average read depth in the gene body ( ˆ χ in our notation), which is a natural measure of the transcriptional output at each gene.The model makes clear-as other investigators have previously argued ( 25 ,26 )-that ˆ χ is a measure of the rate of 'effective' or 'productive' transcription initiation at steady state, rather than of the 'potential' rate.That is, in our notation, χ is proportional to ω rather than α.Moreover, in the presence of steric hindrance, ω ≈ φβ (equation 19 ; see also equation 25 ), which means that ω (and hence χ) is effectively limited by the pause-escape rate β.Notably, this equation provides a simple and interpretable characterization of the 'pause-initiation limit' described by Gressel et al. ( 25 ,26 ).It says that the effective initiation rate ω can never be greater than the pauseescape rate β, with equality when the landing-pad is fully occupied ( φ = 1).The physical meaning of the 'potential' initiation rate α is somewhat less clear but, at least when φ is not too close to one, our model does allow extrapolation to a larger rate at which initiation events would hypothetically occur in the absence of steric hindrance.These values can be contrasted with the effective initiation rate in assessing the importance of steric hindrance.
We should emphasize that our model does depend on some crucial simplifying assumptions.Perhaps the most restrictive of these is the assumption that any initiating RNAP will eventually make its way along the entire DNA template, with no possibility of premature termination (PT).Similarly, we assume that the elongation process will not be blocked or substantially reduced by exogenous factors, such as UV-induced DNA damage.It seems likely that PT does occur to a degree, although its prevalence remains controversial.Several investigators have argued that the dynamics of RNAP occupancy near the promoter is dominated by pausing, rather than PT, because promoter-associated RNAP complexes are quite stable and predominantly localize to the pause region ( 27 , 59 , 69 ) (reviewed in ( 17 )).However, others have argued, based on imaging or footprinting techniques, that PT may play a much more prominent role ( 31 ,70 ).Even if PT does occur at appreciable rates, its effect on our estimates will depend on precisely where and when it occurs.For example, if PT tends to occur immediately after initiation, it may have little effect on our estimates of the effective initiation or pause-escape rates.Indeed, if it occurs within a few nucleotides of the TSS, it may be largely invisible to NRS assays.If, on the other hand, PT tends to occur farther into the pause region, it will tend to influence estimates of the pause-escape rate-causing them to be too low (because some apparently paused RNAPs are actually lost to PT)-but not those of the effective initiation rate.We cannot see any way of fully incorporating PT in a probabilistic model based on NRS data or any other readout of RNAP density-even one that applies to the nonequilibrium setting-without relying on external estimates of the position-dependent rate at which PT occurs; otherwise it will necessarily be confounded with other rates.Until such estimates become available, model-based estimates of the pause-escape rate, in particular, must be considered with this caveat in mind.
A second process we ignore here is 'bursting' of transcription initiation.There is now substantial evidence that rates of transcription initiation do exhibit considerable temporal variation in actively transcribed genes, with apparent oscillations between active and inactive states ( 28-30 , 32 , 71 , 72 ).This pattern is strictly inconsistent with our assumption of a time-homogeneous Markov process.It is worth bearing in mind, however, that our model effectively averages over the processes that are occuring in a large population of cells.Unless those cells are synchronized in their bursting behavior, it would seem that averaging over 'on' and 'off' states in a bursting model, as our Markov model will effectively do, may be adequate for our purposes-although it is true that bursting could contribute to additional overdispersion in our read-count data.

Figure 1 .
Figure 1. ( A ) Conceptual illustration of model, focusing on the kinetic model for RNAP movement on the DNA template.Gray arrow indicates that a second la y er of the model describes generation of nascent RNA sequencing ( NRS ) read counts based on the distribution of RNAP positions across cells.( B ) Graphical model representation with unobserved continuous-time Markov chain ( Z i ) and observed read counts ( X i ) .Read counts at each site X i are conditionally independent and Poisson-distributed given mean μ i , which reflects both the density P ( Z i ) and the sequencing depth λ. ( C ) Design of SimPol ( 'Simulator of Polymerases' ) .Based on user-defined initiation, pause-escape, and elongation rates, SimPol tracks the mo v ement in silico of RNAPs across N -bp DNA templates in C cells, then samples synthetic read counts based on RNAP positions.SimPol identifies collisions and prohibits RNAPs from passing one another.It also models variable pause sites and elongation rates.( D ) Example of synthetic nascent RNA sequencing data from SimPol, shown in IGV ( 74 ) alongside matched real PRO-seq data from ( 19 ) for the DNAJA1 gene on chromosome 9 of the human genome.

Figure 2 .
Figure 2. ( A ) Two-state continuous-time Markov model for steric hindrance of transcriptional initiation, assuming at most one RNAP at a time in the pause region.The pause region must be either unoccupied (state 0) or already occupied by another RNAP (state 1).Transitions from state 0 to state 1

PAGE 11 OF 20 Figure 3 .
Figure 3. Accuracy of estimated values of the transcription initiation rate α and pause-escape rate β under the initial version of the model.Estimates are expressed as products with the elongation rate ζ ( αζ and βζ).( A ) Simulated true vs. estimated values of αζ, for αζ ∈ {0.1, 1, 10} (left to right) and βζ ∈ {0.1, 1, 10} (see k e y).Dashed lines indicate the ground truth.(B-D).Estimated values of βζ for simulated true values of βζ = 1 and αζ ∈ {0.1, 1, 10} (see k e y), when the pause-site k is fixed ( B ) or variable across cells ( C, D ) in simulation.In panel C, β is estimated using the average read-depth in the pause peak, and in panel D it is estimated using the sum of read counts across the region.Dashed lines indicate the ground truth.Results for other values of βζ are shown in Supplementary Figure S1.All boxplots summarize 50 replicates of the simulation; box boundaries indicate 1st and 3rd quartiles, and horizontal line indicates median.A value of ζ = 2 kb / min is assumed; αζ and βζ can be assumed to have units of events per minute.Pause sites occur at a mean position of k = 50 nt.In the variable case, we assume a Gaussian distribution with a standard deviation of 25 nt.

Figure 4 .
Figure 4. ( A ) Accuracy of estimated values of the pause-escape rate β under the version of the model that allows for a distribution of pause-sites k across cells.Shown are estimated values of βζ for simulated true values of βζ = 1 and αζ ∈ {0.1, 1, 10} (see k e y), when the pause-site k is fixed (left) or variable across cells (right) in simulation.Dashed lines indicate the ground truth.Results for other values of βζ are shown in Supplementary Figure S2.All boxplots summarize 50 replicates of the simulation; box boundaries indicate 1st and 3rd quartiles, and horizontal line indicates median.Simulated pause sites occurred at a mean position of k = 50 nt.In the variable case, we assumed a Gaussian distribution with a standard deviation of 25 nt.( B ) Examples of pause peaks in simulated data, showing assumed distribution of pause sites (blue dashed line) and distribution inferred by expectation maximization (red solid line).( C ) Similar examples from real data from ( 20 ). ( D ) Estimates of βζ under the original a v eraging approach (horizontal axis) vs. estimates of βζ under the model that allows for variable k across cells (vertical axis).( E ) Contour plot showing the distribution of estimated means (horizontal axis) and standard deviations (vertical axis) of the pause peak position k , under the 'no heat shock' (NHS) and 'heat shock' (HS) conditions.

PAGE 14 OF 20 NucleicFigure 5 .
Figure 5. ( A ) Accuracy of estimated landing-pad occupancy φ under the version of the model that allows for steric hindrance in initiation and multiple RNAPs per pause region.Scatter plots show the fraction of simulated cells for which the first 50 nt (the 'landing-pad') are occupied by an RNAP at steady state ('Empirical φ') vs. the fraction predicted to be occupied under the model ('Estimated φ') based on the simulated NRS data, assuming a minimum spacing of s p = 50 nt.Results are shown for simulated true values of αζ ∈ {0.1, 1, 10} (left to right) and βζ ∈ {0.1, 1, 10} (see k e y), with 50 simulations per parameter combination.Dashed line indicates y = x , and colored crosses represent the means of the corresponding points.A value of ζ = 2 kb / min is assumed, so that αζ and βζ are in e v ents per minute.( B ) Distribution of estimated φ for 6182 robustly expressed genes in K562 cells before (NHS) and after (HS) heat shock under the low (L) calibration ( 20 ) (see Materials & Methods for details).( C ) Percentages of genes having fully occupied landing-pads ( φ > 0.95) before (NHS) and after (HS) heat shock, under the low (L) and high (H) calibrations.( D, E ) Distributions of scaled estimates of the 'effective' ( ω ζ) and 'potential' ( αζ) rates of transcription initiation, in events per minute per cell, for the same genes.Panel D represents the NHS case and panel E represents the HS case.The x -axes are truncated to highlight the bulk of the distributions.Gray arrows indicate effects of steric hindrance.

PAGE 16 OF 20 Nucleic
Acids Research , 2023, Vol.51, No. 21, e106 Assume that, for k ∈ { k min , …, k max }, f k represents the fraction of cells with pause site k , and that the read count X k derives from a mixture of one Poisson distribution with rate χ/ β • f k and a second Poisson distribution with rate χ • (1 − f k ).If we denote by Y k the (unknown) portion of the read count that derives from the